IEICE global.ieice.org Site

Author Search Result

[Author] Nozomu TOGAWA(97hit)

81-97hit(97hit)

A SIMD Instruction Set and Functional Unit Synthesis Algorithm with SIMD Operation Decomposition
Nozomu TOGAWA Koichi TACHIKAKE Yuichiro MIYAOKA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-Programmable Logic, VLSI, CAD and Layout

Vol:
E88-D No:7
Page(s):
1340-1349
This paper focuses on SIMD processor synthesis and proposes a SIMD instruction set/functional unit synthesis algorithm. Given an initial assembly code and a timing constraint, the proposed algorithm synthesizes an area-optimized processor core with optimal SIMD functional units. It also synthesizes a SIMD instruction set. The input initial assembly code is assumed to run on a full-resource SIMD processor (virtual processor) which has all the possible SIMD functional units. In our algorithm, we introduce the SIMD operation decomposition and apply it to the initial assembly code and the full-resource SIMD processor. By gradually reducing SIMD operations or decomposing SIMD operations, we can finally find a processor core with small area under the given timing constraint. The promising experimental results are also shown.
Extension and Performance/Accuracy Formulation for Optimal GeAr-Based Approximate Adder Designs
Ken HAYAMIZU Nozomu TOGAWA Masao YANAGISAWA Youhua SHI

PAPER

Vol:
E101-A No:7
Page(s):
1014-1024
Approximate computing is a promising solution for future energy-efficient designs because it can provide great improvements in performance, area and/or energy consumption over traditional exact-computing designs for non-critical error-tolerant applications. However, the most challenging issue in designing approximate circuits is how to guarantee the pre-specified computation accuracy while achieving energy reduction and performance improvement. To address this problem, this paper starts from the state-of-the-art general approximate adder model (GeAr) and extends it for more possible approximate design candidates by relaxing the design restrictions. And then a maximum-error-distance-based performance/accuracy formulation, which can be used to select the performance/energy-accuracy optimal design from the extended design space, is proposed. Our evaluation results show the effectiveness of the proposed method in terms of area overhead, performance, energy consumption, and computation accuracy.
A Hardware-Trojan Classification Method Using Machine Learning at Gate-Level Netlists Based on Trojan Features
Kento HASEGAWA Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E100-A No:7
Page(s):
1427-1438
Due to the increase of outsourcing by IC vendors, we face a serious risk that malicious third-party vendors insert hardware Trojans very easily into their IC products. However, detecting hardware Trojans is very difficult because today's ICs are huge and complex. In this paper, we propose a hardware-Trojan classification method for gate-level netlists to identify hardware-Trojan infected nets (or Trojan nets) using a support vector machine (SVM) or a neural network (NN). At first, we extract the five hardware-Trojan features from each net in a netlist. These feature values are complicated so that we cannot give the simple and fixed threshold values to them. Hence we secondly represent them to be a five-dimensional vector and learn them by using SVM or NN. Finally, we can successfully classify all the nets in an unknown netlist into Trojan ones and normal ones based on the learned classifiers. We have applied our machine-learning-based hardware-Trojan classification method to Trust-HUB benchmarks. The results demonstrate that our method increases the true positive rate compared to the existing state-of-the-art results in most of the cases. In some cases, our method can achieve the true positive rate of 100%, which shows that all the Trojan nets in an unknown netlist are completely detected by our method.
Greedy Algorithm for the On-Chip Decoupling Capacitance Optimization to Satisfy the Voltage Drop Constraint
Mikiko SODE TANAKA Nozomu TOGAWA Masao YANAGISAWA Satoshi GOTO

PAPER-Physical Level Design

Vol:
E94-A No:12
Page(s):
2482-2489
With the progress of process technology in recent years, low voltage power supplies have become quite predominant. With this, the voltage margin has decreased and therefore the on-chip decoupling capacitance optimization that satisfies the voltage drop constraint becomes more important. In addition, the reduction of the on-chip decoupling capacitance area will reduce the chip area and, therefore, manufacturing costs. Hence, we propose an algorithm that satisfies the voltage drop constraint and at the same time, minimizes the total on-chip decoupling capacitance area. The proposed algorithm uses the idea of the network algorithm where the path which has the most influence on voltage drop is found. Voltage drop is improved by adding the on-chip capacitance to the node on the path. The proposed algorithm is efficient and effectively adds the on-chip capacitance to the greatest influence on the voltage drop. Experimental results demonstrate that, with the proposed algorithm, real size power/ground network could be optimized in just a few minutes which are quite practical. Compared with the conventional algorithm, we confirmed that the total on-chip decoupling capacitance area of the power/ground network was reducible by about 4050%.
A Hardware/Software Cosynthesis Algorithm for Processors with Heterogeneous Datapaths
Yuichiro MIYAOKA Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER

Vol:
E87-A No:4
Page(s):
830-836
This paper proposes a hardware/software cosynthesis algorithm for processors with heterogeneous registers. Given a CDFG corresponding to an application program and a timing constraint, the algorithm generates a processor configuration minimizing area of the processor and an assembly code on the processor. First, the algorithm configures a datapath which can execute several DFG nodes with data dependency at one cycle. The datapath can execute the application program at the least number of cycles. The branch and bound algorithm is applied and all the number of functional units and memory banks are tried. For an assumed number of functional units and memory banks, an appropriate number of heterogeneous registers and connections to functional units and registers are explored. The experimental results show effectiveness and efficiency of the algorithm.
Selective Low-Care Coding: A Means for Test Data Compression in Circuits with Multiple Scan Chains
Youhua SHI Nozomu TOGAWA Shinji KIMURA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER

Vol:
E89-A No:4
Page(s):
996-1004
This paper presents a test input data compression technique, Selective Low-Care Coding (SLC), which can be used to significantly reduce input test data volume as well as the external test channel requirement for multiscan-based designs. In the proposed SLC scheme, we explored the linear dependencies of the internal scan chains, and instead of encoding all the specified bits in test cubes, only a smaller amount of specified bits are selected for encoding, thus greater compression can be expected. Experiments on the larger benchmark circuits show drastic reduction in test data volume with corresponding savings on test application time can be indeed achieved even for the well-compacted test set.
A performance-Oriented Simultaneous Placement and Global Routing Algorithm for Transport-Processing FPGAs
Nozomu TOGAWA Masao SATO Tatsuo OHTSUKI

PAPER

Vol:
E80-A No:10
Page(s):
1795-1806
In layout design of transport-processing FPGAs, it is required that not only routing congestion is kept small but also circuits implemented on them operate with higher operation frequency. This paper extends the proposed simultaneous placement and global routing algorithm for transport-processing FPGAs whose objective is to minimize routing congestion and proposes a new algorithm in which the length of each critical signal path (path length) is limited within a specified upper bound imposed on it (path length constraint). The algorithm is based on hierarchical bipartitioning of layout regions and LUT (Look Up Table) sets to be placed. In each bipartitioning, the algorithm first searches the paths with tighter path length constraints by estimating their path lengths. Second the algorithm proceeds the bipartitioning so that the path lengths of critical paths can be reduced. The algorithm is applied to transport-processing circuits and compared with conventional approaches. The results demonstrate that the algorithm satisfies the path length constraints for 11 out of 13 circuits, though it increases routing congestion by an average of 20%. After detailed routing, it achieves 100% routing for all the circuits and decreases a circuit delay by an average of 23%.
A Scan-Based Attack Based on Discriminators for AES Cryptosystems
Ryuta NARA Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-Embedded, Real-Time and Reconfigurable Systems

Vol:
E92-A No:12
Page(s):
3229-3237
A scan chain is one of the most important testing techniques, but it can be used as side-channel attacks against a cryptography LSI. We focus on scan-based attacks, in which scan chains are targeted for side-channel attacks. The conventional scan-based attacks only consider the scan chain composed of only the registers in a cryptography circuit. However, a cryptography LSI usually uses many circuits such as memories, micro processors and other circuits. This means that the conventional attacks cannot be applied to the practical scan chain composed of various types of registers. In this paper, a scan-based attack which enables to decipher the secret key in an AES cryptography LSI composed of an AES circuit and other circuits is proposed. By focusing on bit pattern of the specific register and monitoring its change, our scan-based attack eliminates the influence of registers included in other circuits than AES. Our attack does not depend on scan chain architecture, and it can decipher practical AES cryptography LSIs.
Fast Scheduling and Allocation Algorithms for Entropy CODEC
Katsuharu SUZUKI Nozomu TOGAWA Masao SATO Tatsuo OHTSUKI

PAPER-High Level Synthesis

Vol:
E80-D No:10
Page(s):
982-992
Entropy coding/decoding are implemented on FPGAs as a fast and flexible system in which high-level synthesis technologies are key issues. In this paper, we propose scheduling and allocation algorithms for behavioral descriptions of entropy CODEC. The scheduling algorithm employs a control-flow graph as input and finds a solution with minimal hardware cost and execution time by merging nodes in the control-flow graph. The allocation algorithm assigns operations to operators with various bit lengths. As a result, register-transfer level descriptions are efficiently obtained from behavioral descriptions of entropy CODEC with complicated control flow and variable bit lengths. Experimental results demonstrate that our algorithms synthesize the same circuits as manually designed within one second.
Partially-Parallel LDPC Decoder Achieving High-Efficiency Message-Passing Schedule
Kazunori SHIMIZU Tatsuyuki ISHIKAWA Nozomu TOGAWA Takeshi IKENAGA Satoshi GOTO

PAPER

Vol:
E89-A No:4
Page(s):
969-978
In this paper, we propose a partially-parallel LDPC decoder which achieves a high-efficiency message-passing schedule. The proposed LDPC decoder is characterized as follows: (i) The column operations follow the row operations in a pipelined architecture to ensure that the row and column operations are performed concurrently. (ii) The proposed parallel pipelined bit functional unit enables the column operation module to compute every message in each bit node which is updated by the row operations. These column operations can be performed without extending the single iterative decoding delay when the row and column operations are performed concurrently. Therefore, the proposed decoder performs the column operations more frequently in a single iterative decoding, and achieves a high-efficiency message-passing schedule within the limited decoding delay time. Hardware implementation on an FPGA and simulation results show that the proposed partially-parallel LDPC decoder improves the decoding throughput and bit error performance with a small hardware overhead.
A Highly-Adaptable and Small-Sized In-Field Power Analyzer for Low-Power IoT Devices
Ryosuke KITAYAMA Takashi TAKENAKA Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E99-A No:12
Page(s):
2348-2362
Power analysis for IoT devices is strongly required to protect attacks from malicious attackers. It is also very important to reduce power consumption itself of IoT devices. In this paper, we propose a highly-adaptable and small-sized in-field power analyzer for low-power IoT devices. The proposed power analyzer has the following advantages: (A) The proposed power analyzer realizes signal-averaging noise reduction with synchronization signal lines and thus it can reduce wide frequency range of noises; (B) The proposed power analyzer partitions a long-term power analysis process into several analysis segments and measures voltages and currents of each analysis segment by using small amount of data memories. By combining these analysis segments, we can obtain long-term analysis results; (C) The proposed power analyzer has two amplifiers that amplify current signals adaptively depending on their magnitude. Hence maximum readable current can be increased with keeping minimum readable current small enough. Since all of (A), (B) and (C) do not require complicated mechanisms nor circuits, the proposed power analyzer is implemented on just a 2.5cm×3.3cm board, which is the smallest size among the other existing power analyzers for IoT devices. We have measured power and energy consumption of the AES encryption process on the IoT device and demonstrated that the proposed power analyzer has only up to 1.17% measurement errors compared to a high-precision oscilloscope.
A Bit-Write-Reducing and Error-Correcting Code Generation Method by Clustering ECC Codewords for Non-Volatile Memories
Tatsuro KOJO Masashi TAWADA Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E99-A No:12
Page(s):
2398-2411
Non-volatile memories are paid attention to as a promising alternative to memory design. Data stored in them still may be destructed due to crosstalk and radiation. We can restore the data by using error-correcting codes which require extra bits to correct bit errors. Further, non-volatile memories consume ten to hundred times more energy than normal memories in bit-writing. When we configure them using error-correcting codes, it is quite necessary to reduce writing bits. In this paper, we propose a method to generate a bit-write-reducing code with error-correcting ability. We first pick up an error-correcting code which can correct t-bit errors. We cluster its codeswords and generate a cluster graph satisfying the S-bit flip conditions. We assign a data to be written to each cluster. In other words, we generate one-to-many mapping from each data to the codewords in the cluster. We prove that, if the cluster graph is a complete graph, every data in a memory cell can be re-written into another data by flipping at most S bits keeping error-correcting ability to t bits. We further propose an efficient method to cluster error-correcting codewords. Experimental results show that the bit-write-reducing and error-correcting codes generated by our proposed method efficiently reduce energy consumption. This paper proposes the world-first theoretically near-optimal bit-write-reducing code with error-correcting ability based on the efficient coding theories.
A Retargetable Simulator Generator for DSP Processor Cores with Packed SIMD-type Instructions
Nozomu TOGAWA Kyosuke KASAHARA Yuichiro MIYAOKA Jinku CHOI Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-Simulation Accelerator

Vol:
E86-A No:12
Page(s):
3099-3109
A packed SIMD type operation or a SIMD operation is n-parallel b/n-bit sub-operations executed by the modified n-bit functional unit. Such a functional unit is called a SIMD functional unit and a processor core which can execute SIMD operations is called a SIMD processor core. SIMD operations can be effectively applied to image processing applications. This paper focuses on hardware/software cosynthesis of SIMD processor cores and particularly proposes a new simulator generator which simulates pipelined instructions for a SIMD processor. Generally, a SIMD functional unit has many options and then we can have so many different SIMD functional unit instances. However, since our hardware/software cosynthesis system synthesizes a special-purpose processor core for an input application program, it uses very limited SIMD functional unit instances. In the proposed approach, we consider a SIMD operation to be a set of SIMD sub-operations. By adding up the appropriate SIMD sub-operations, we construct a single SIMD operation. Then a SIMD functional unit behavior can be characterized by a collection of SIMD operations. This approach has the advantage that: if we have a small number of behavior libraries for SIMD sub-operations, we can instantiate a particular SIMD functional unit behavior. Experimental results demonstrate the effectiveness of the proposed approach.
A Circuit Partitioning Algorithm with Replication Capability for Multi-FPGA Systems
Nozomu TOGAWA Masao SATO Tatsuo OHTSUKI

PAPER

Vol:
E78-A No:12
Page(s):
1765-1776
In circuit partitioning for FPGAs, partitioned signal nets are connected using I/O blocks, through which signals are coming from or going to external pins. However, the number of I/O blocks per chip is relatively small compared with the number of logic-blocks, which realize logic functions, accommodated in the FPGA chip. Because of the I/O block limitation, the size of a circuit implemented on each FPGA chip is usually small, which leads to a serious decrease of logic-block utilization. It is required to utilize unused logic-blocks in terms of reducing the number of I/O blocks and realize circuits on given FPGA chips. In this paper, we propose an algorithm which partitions an initial circuit into multi-FPGA chips. The algorithm is based on recursive bi-partitioning of a circuit. In each bi-partitioning, it searches a partitioning position of a circuit such that each of partitioned subcircuits is accommodated in each FPGA chip with making the number of signal nets between chips as small as possible. Such bi-partitioning is achieved by computing a minimum cut repeatedly applying a network flow technique, and replicating logic-blocks appropriately. Since a set of logic-blocks assigned to each chip is computed separately, logic-blocks to be replicated are naturally determined. This means that the algorithm makes good use of unused logic-blocks from the viewpoint of reducing the number of signal nets between chips, i.e. the number of required I/O blocks. The algorithm has been implemented and applied to MCNC PARTITIONING 93 benchmark circuits. The experimental results demonstrate that it decreases the maximum number of I/O blocks per chip by a maximum of 49% compared with conventional algorithms.
Empirical Evaluation and Optimization of Hardware-Trojan Classification for Gate-Level Netlists Based on Multi-Layer Neural Networks
Kento HASEGAWA Masao YANAGISAWA Nozomu TOGAWA

LETTER

Vol:
E101-A No:12
Page(s):
2320-2326
Recently, it has been reported that malicious third-party IC vendors often insert hardware Trojans into their products. Especially in IC design step, malicious third-party vendors can easily insert hardware Trojans in their products and thus we have to detect them efficiently. In this paper, we propose a machine-learning-based hardware-Trojan detection method for gate-level netlists using multi-layer neural networks. First, we extract 11 Trojan-net feature values for each net in a netlist. After that, we classify the nets in an unknown netlist into a set of Trojan nets and that of normal nets using multi-layer neural networks. By experimentally optimizing the structure of multi-layer neural networks, we can obtain an average of 84.8% true positive rate and an average of 70.1% true negative rate while we can obtain 100% true positive rate in some of the benchmarks, which outperforms the existing methods in most of the cases.
A Circuit Partitioning Algorithm with Path Delay Constraints for Multi-FPGA Systems
Nozomu TOGAWA Masao SATO Tatsuo OHTSUKI

PAPER

Vol:
E80-A No:3
Page(s):
494-505
In this paper, we extend the circuit partitioning algorithm which we have proposed for multi-FPGA systems and present a new algorithm in which the delay of each critical signal path is within a specified upper bound imposed on it. The core of the presented algorithm is recursive bipartitioning of a circuit. The bipartitioning procedure consists of three stages: 0) detection of critical paths; 1) bipartitioning of a set of primary inputs and outputs; and 2) bipartitioning of a set of logic-blocks. In 0), the algorithm computes the lower bounds of delays for paths with path delay constraints and detects the critical paths based on the difference between the lower and upper bound dynamically in every bipartitioning procedure. The delays of the critical paths are reduced with higher priority. In 1), the algorithm attempts to assign the primary inputs and outputs on each critical path to one chip so that the critical path does not cross between chips. Finally in 2), the algorithm not only decreases the number of crossings between chips but also assigns the logic-blocks on each critical path to one chip by exploiting a network flow technique. The algorithm has been implemented and applied to MCNC PARTITIONING 93 benchmark circuits. The experimental results demonstrate that it resolves almost all path delay constraints with maintaining the maximum number of required I/O blocks per chip small compared with conventional alogorithms.
A Simultaneous Technology Mapping, Placement, and Global Routing Algorithm for FPGAs with Path Delay Constraints
Nozomu TOGAWA Masao SATO Tatsuo OHTSUKI

PAPER

Vol:
E79-A No:3
Page(s):
321-329
In this paper, we propose a new FPGA design algorithm, Maple-opt, in which technology mapping, placement, and global routing are executed so that the delay of each critical signal path in an input circuit is within a specified upper bound imposed on it. The basic algorithm of Maple-opt is top-down hi-erarchical bi-partitioning of regions. Technology mapping onto logic-blocks of FPGAs, their placement, and global routing are determined simulatenously in each hierarchical process. This simultaneity leads to less congested layout for routing. In addition to that, Maple-opt computes a lower bound of delay for each path with a constraint value and determines critical paths based on the difference between the lower bound and the constraint value dynamically in each hierarchical process. Two delay reduction processes are executed for the critical paths; one is routing delay reduction and the other is logic-block delay reduction. Routing delay reduction is realized such that, when bi-partitioning a region, each constrained path is assigned to one subregion. Logic-block delay reduction is realized such that each constrained path is mapped onto fewer logic-blocks. Experimental results for some benchmark circuits show its efficiency and effectiveness.

81-97hit(97hit)

Author Search Result

[Author] Nozomu TOGAWA(97hit)

A SIMD Instruction Set and Functional Unit Synthesis Algorithm with SIMD Operation Decomposition

Extension and Performance/Accuracy Formulation for Optimal GeAr-Based Approximate Adder Designs

A Hardware-Trojan Classification Method Using Machine Learning at Gate-Level Netlists Based on Trojan Features

Greedy Algorithm for the On-Chip Decoupling Capacitance Optimization to Satisfy the Voltage Drop Constraint

A Hardware/Software Cosynthesis Algorithm for Processors with Heterogeneous Datapaths

Selective Low-Care Coding: A Means for Test Data Compression in Circuits with Multiple Scan Chains

A performance-Oriented Simultaneous Placement and Global Routing Algorithm for Transport-Processing FPGAs

A Scan-Based Attack Based on Discriminators for AES Cryptosystems

Fast Scheduling and Allocation Algorithms for Entropy CODEC

Partially-Parallel LDPC Decoder Achieving High-Efficiency Message-Passing Schedule

A Highly-Adaptable and Small-Sized In-Field Power Analyzer for Low-Power IoT Devices

A Bit-Write-Reducing and Error-Correcting Code Generation Method by Clustering ECC Codewords for Non-Volatile Memories

A Retargetable Simulator Generator for DSP Processor Cores with Packed SIMD-type Instructions

A Circuit Partitioning Algorithm with Replication Capability for Multi-FPGA Systems

Empirical Evaluation and Optimization of Hardware-Trojan Classification for Gate-Level Netlists Based on Multi-Layer Neural Networks

A Circuit Partitioning Algorithm with Path Delay Constraints for Multi-FPGA Systems

A Simultaneous Technology Mapping, Placement, and Global Routing Algorithm for FPGAs with Path Delay Constraints

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles